Importing required Libraries

In [1]:
import numpy as np
import pandas as pd
import json
import math
import os
import matplotlib.pyplot as plt
import plotly.offline as pyo
import plotly.graph_objs as go
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
In [2]:
path=os.getcwd()
path
Out[2]:
'C:\\Users\\jayam\\Downloads\\7879phd'

Reading the final merged train data

In [3]:
train=pd.read_csv(path +"\\train_finaldata.csv", sep=",", na_values=["?",",","#","NaN","unknown",""])

Head & shape of data

In [4]:
train.head()
Out[4]:
year month day city medicine date sales discounted footfall year_month city_medicine week
0 2015 1 2 1 1 2015-01-02 24.0 NaN 13948.0 2015_01 1_1 1
1 2015 1 2 1 2 2015-01-02 144.0 NaN 13948.0 2015_01 1_2 1
2 2015 1 2 1 3 2015-01-02 84.0 NaN 13948.0 2015_01 1_3 1
3 2015 1 2 1 4 2015-01-02 24.0 NaN 13948.0 2015_01 1_4 1
4 2015 1 2 1 5 2015-01-02 8.0 NaN 13948.0 2015_01 1_5 1
In [5]:
train.shape
Out[5]:
(22601741, 12)

Converting to date format

In [6]:
train["date"]=pd.to_datetime(train["date"])

Reading city json file and modifying to as per requiremnt of the below graph

In [7]:
with open(path+"\\city_dict.json") as f: 
      data = json.load(f)
city = {value: key for key, value in data.items()}
city={k:int(v) for k, v in city.items()}
city={value: key for key, value in city.items()}

Let us see the top cities contributing to sales

In [8]:
import plotly.express as px
city_sales=train.groupby(["city"])["sales"].sum().sort_values(ascending=True).reset_index()
city_sales["sales"]=np.round(city_sales["sales"])
city_sales["city"]=city_sales["city"].map(city)
fig5 = px.bar(city_sales, x="city", y="sales",  barmode='group',title="Aggregated sales for each City")
fig5.show()
  • From the above graph we could see that top 4 cities contributing to sales are Mumbai,Delhi,Kolkata and Bangalore

Mumbai

Let us see over all sales for each medicine in Mumbai
In [9]:
mumbai=train[train["city"]==4]
In [10]:
mumbai.groupby(["medicine"])["sales"].mean().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10  sold medicine types in Mumbai")
plt.ylabel("sales")
plt.figure(figsize=(5,5))
plt.show()
<Figure size 360x360 with 0 Axes>
  • From the above graph we could say that In Mumbai most sold medicine type is medicine_type 2103 follwed by 2728 and 2051

Let us check the top 10 contribution of medicine types to sales for every year in Mumbai

Mumbai 2015

In [11]:
mumbai[mumbai["year"]==2015].groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10 sold medicine types in Mumbai in the year 2015")
plt.ylabel("sales")
plt.show()
  • From the above graph we could say that In Mumbai most sold medicine type in the year 2015 is medicine_type 1077 follwed by 1083 and 1062

Mumbai 2016

In [12]:
mumbai[mumbai["year"]==2016].groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10 sold medicine types in Mumbai in the year 2016")
plt.ylabel("sales")
plt.show()
  • From the above graph we could say that In Mumbai most sold medicine type in the year 2016 is medicine_type 1077 follwed by 2103 and 2051

Mumbai 2017

In [13]:
mumbai[mumbai["year"]==2017].groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10 sold medicine types in Mumbai in the year 2017")
plt.ylabel("sales")
plt.show()
  • From the above graph we could say that In Mumbai most sold medicine type in the year 2017 is medicine_type 1077 follwed by 2103 and 2728

Mumbai 2018

In [14]:
mumbai[mumbai["year"]==2018].groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10 sold medicine types in Mumbai in the year 2018")
plt.ylabel("sales")
plt.show()
  • From the above graph we could say that In Mumbai most sold medicine type in the year 2018 is medicine_type 2103 follwed by 1077 and 2728
  • From the above analysis we can say that mumbai has same kind of medicines which are contributing to sales every year

Sales of Mumbai aggregated at year

In [15]:
mumbai.groupby(["year"])["sales"].sum().plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("sales of medicines in Mumbai")
plt.ylabel("sales")
plt.show()
  • We could clearly see number of sales increased every year, and for 2018 since its only for 6 months it is showing a lesser number

Let us see how the discount percentage varies across the 3 years in Mumbai

In [16]:
from pandas import DataFrame
import plotly.express as px
train1 = DataFrame({'percentage_of_sales' : np.round((mumbai.groupby( [ "year",'discounted'] ).size()/mumbai.groupby( [ "year",'discounted'] ).size().sum())*100,2)}).reset_index()
fig5 = px.bar(train1, x="year", y='percentage_of_sales', color='discounted',labels='percentage_of_sales',text='percentage_of_sales',title="Contribution of each year sales towards overall sales in Mumbai ")
fig5.show()
  • We could see there is a gradual increase in percentage of discounted sales

Delhi

In [17]:
delhi=train[train["city"]==5]

Delhi 2015

In [18]:
delhi[delhi["year"]==2015].groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10 sold medicine types in Delhi in the year 2015")
plt.ylabel("sales")
plt.show()
  • From the above graph we could say that In Delhi most sold medicine type in the year 2015 is medicine_type 574 follwed by 1077 and 316

Delhi 2016

In [19]:
delhi[delhi["year"]==2016].groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10 sold medicine types in Delhi in the year 2016")
plt.ylabel("sales")
plt.show()
  • From the above graph we could say that In Delhi most sold medicine type in the year 2016 is medicine_type 2103 follwed by 2051 and 574

Delhi 2017

In [20]:
delhi[delhi["year"]==2017].groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10 sold medicine types in Delhi in the year 2017")
plt.ylabel("sales")
plt.show()
  • From the above graph we could say that In Delhi most sold medicine type in the year 2017 is medicine_type 2103 follwed by 2414 and 1971

Delhi 2018

In [21]:
delhi[delhi["year"]==2018].groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10 sold medicine types in Delhi in the year 2018")
plt.ylabel("sales")
plt.show()
  • From the above graph we could say that In Delhi most sold medicine type in the year 2018 is medicine_type 2103 follwed by 2051 and 2414
  • From the above analysis we can say that Delhi has same kind of medicines which are contributing to sales every year

Kolkata 2015

In [22]:
Kolkata=train[train["city"]==1]
Kolkata[Kolkata["year"]==2015].groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10 sold medicine types in Kolkata in the year 2015")
plt.ylabel("sales")
plt.show()
  • From the above graph we could say that In Kolkata most sold medicine type in the year 2015 is medicine_type 1077 follwed by 316 and 314

Kolkata 2016

In [23]:
Kolkata[Kolkata["year"]==2016].groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10 sold medicine types in Kolkata in the year 2016")
plt.ylabel("sales")
plt.show()
  • From the above graph we could say that In Kolkata most sold medicine type in the year 2016 is medicine_type 1077 follwed by 2103 and 2051

Kolkata 2017

In [24]:
Kolkata[Kolkata["year"]==2017].groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10 sold medicine types in Kolkata in the year 2017")
plt.ylabel("sales")
plt.show()
  • From the above graph we could say that In Kolkata most sold medicine type in the year 2017 is medicine_type 1077 follwed by 2103 and 2728

Kolkata 2018

In [25]:
Kolkata[Kolkata["year"]==2018].groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10 sold medicine types in Kolkata in the year 2018")
plt.ylabel("sales")
plt.show()
  • From the above graph we could say that In Kolkata most sold medicine type in the year 2018 is medicine_type 2103 follwed by 1077 and 2728
  • From the above analysis we can say that Kolkata has same kind of medicines which are contributing to sales every year

Bangalore 2015

In [26]:
Bangalore=train[train["city"]==7]
Bangalore[Bangalore["year"]==2015].groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10 sold medicine types in Bangalore in the year 2015")
plt.ylabel("sales")
plt.show()
  • From the above graph we could say that In Bangalore most sold medicine type in the year 2015 is medicine_type 768 follwed by 574 and 318

Bangalore 2016

In [27]:
Bangalore[Bangalore["year"]==2016].groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10 sold medicine types in Bangalore in the year 2016")
plt.ylabel("sales")
plt.show()
  • From the above graph we could say that In Bangalore most sold medicine type in the year 2016 is medicine_type 1517 follwed by 2103 and 2051

Bangalore 2017

In [28]:
Bangalore[Bangalore["year"]==2017].groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10 sold medicine types in Bangalore in the year 2017")
plt.ylabel("sales")
plt.show()
  • From the above graph we could say that In Bangalore most sold medicine type in the year 2017 is medicine_type 1077 follwed by 2103 and 2414

Bangalore 2018

In [29]:
Bangalore[Bangalore["year"]==2018].groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].plot.bar()
plt.xticks(rotation=45, ha='right')
plt.title("Top 10 sold medicine types in Bangalore in the year 2018")
plt.ylabel("sales")
plt.show()
  • From the above graph we could say that In Bangalore most sold medicine type in the year 2018 is medicine_type 2103 follwed by 2627 and 2051
  • From the above analysis we can say that Bangalore has same kind of medicines which are contributing to sales every year

Let us see the sales pattern for all the months in Mumbai

In [30]:
lt=list(mumbai.groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:5].index)     
data=[go.Bar(x=list(mumbai[mumbai["medicine"]==col].groupby(["year_month"])["sales"].sum().index),y=list(mumbai[mumbai["medicine"]==col].groupby(["year_month"])["sales"].sum().values),name="medicine_"+str(col)) for col  in lt]
layout=go.Layout(title="Montly sales of first 5 medicines in Mumbai",xaxis={"title":"months"},yaxis=dict(title="sales"))
fig=go.Figure(data=data,layout=layout)
fig.show()
  • We could clearly see medicine type 1083 & 1077 is sold on every month so far in mumbai which has contributed most of the sales and medicine type 2103 looks like seasonal medicine as it is sold only on specific months

Let us explore the sales of different medicine types at month level

Let us see the sales pattern for Mumbai in the year 2017

In [31]:
m2017=mumbai[mumbai["year"]==2017]
lt=list(m2017.groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:5].index)     
data=[go.Bar(x=list(m2017[m2017["medicine"]==col].groupby(["month"])["sales"].sum().index),y=list(m2017[m2017["medicine"]==col].groupby(["month"])["sales"].sum().values),name="medicine_"+str(col)) for col  in lt]
layout=go.Layout(title="Montly sales of first 10 medicines in Mumbai in the year 2017",xaxis={"title":"months"},yaxis=dict(title="sales"))
fig=go.Figure(data=data,layout=layout)
fig.show()
  • Some of the medicines looks like seasonal medicines as sales start at mid of the year and most of the sales are coming from 2nf half of the year only.
  • Let us double check with aggregated monly sales in mumbai.

Let us see the monthly sales pattern for Mumbai in the year 2017

In [32]:
mum_city_sales=mumbai[mumbai["year"]==2017].groupby(["month"])["sales"].sum().sort_values(ascending=True).reset_index()
fig5 = px.bar(mum_city_sales, x="month", y="sales",  barmode='group',title="Aggregated monthly sales in Mumbai in the year 2017")
fig5.show()
  • Our inference is correct as we could see clearly sales below 6th month is very less when compared to over all year

Let us check if the same pattern is there in top 4 other cities in the year 2017

Delhi 2017

In [33]:
d2017=delhi[delhi["year"]==2017]#d2017=delhi 2017,ltd=list_delhi
ltd=list(d2017.groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:5].index)     
data=[go.Bar(x=list(d2017[d2017["medicine"]==col].groupby(["month"])["sales"].sum().index),y=list(d2017[d2017["medicine"]==col].groupby(["month"])["sales"].sum().values),name="medicine_"+str(col)) for col  in ltd]
layout=go.Layout(title="Montly sales of first 10 medicines in Delhi in the year 2017",xaxis={"title":"months"},yaxis=dict(title="sales"))
fig=go.Figure(data=data,layout=layout)
fig.show()
  • So even in Delhi we have observed the same pattern where we have more sales in 2nd half of the year 2017

Bangalore 2017

In [34]:
b2017=Bangalore[Bangalore["year"]==2017]#b2017=Bangalore 2017,ltb=list_bangalore
ltb=list(b2017.groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:5].index)     
data=[go.Bar(x=list(b2017[b2017["medicine"]==col].groupby(["month"])["sales"].sum().index),y=list(b2017[b2017["medicine"]==col].groupby(["month"])["sales"].sum().values),name="medicine_"+str(col)) for col  in ltb]
layout=go.Layout(title="Montly sales of first 10 medicines in Bangalore in the year 2017",xaxis={"title":"months"},yaxis=dict(title="sales"))
fig=go.Figure(data=data,layout=layout)
fig.show()
  • So even in Bangalore the pattern is more pronounced where we have more sales in 2nd half of the year 2017

Kolkata 2017

In [35]:
k2017=Kolkata[Kolkata["year"]==2017]#b2017=kolkata 2017,ltb=list_kolkata
ltk=list(k2017.groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:5].index)     
data=[go.Bar(x=list(k2017[k2017["medicine"]==col].groupby(["month"])["sales"].sum().index),y=list(k2017[k2017["medicine"]==col].groupby(["month"])["sales"].sum().values),name="medicine_"+str(col)) for col  in ltk]
layout=go.Layout(title="Montly sales of first 10 medicines in kolkata in the year 2017",xaxis={"title":"months"},yaxis=dict(title="sales"))
fig=go.Figure(data=data,layout=layout)
fig.show()
  • So even in Kolkata we find the similar pattern where we have more sales in 2nd half of the year 2017
  • we have also observed that medicine type 1077 is top selling medicine in all the places except in Delhi

so we can make one feature whether the sale is in second half of 2017 or not (0 or 1), it may help the model

Let us check whether we have same pattern for other years

Mumbai 2016
In [36]:
m2016=mumbai[mumbai["year"]==2016]#m2016=Mumbai 2016,ltm6=list_mumbai_2016
ltm6=list(m2016.groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:5].index)     
data=[go.Bar(x=list(m2016[m2016["medicine"]==col].groupby(["month"])["sales"].sum().index),y=list(m2016[m2016["medicine"]==col].groupby(["month"])["sales"].sum().values),name="medicine_"+str(col)) for col  in ltm6]
layout=go.Layout(title="Montly sales of first 5 medicines in Mumbai in the year 2016",xaxis={"title":"months"},yaxis=dict(title="sales"))
fig=go.Figure(data=data,layout=layout)
fig.show()
  • We could see a different pattern here since there are more sales in 1st & 2nd month too
  • Let us check if this same pattern exist in other cities too
Delhi 2016
In [37]:
d2016=delhi[delhi["year"]==2016]#d2016=Delhi 2016,ltd6=list_Delhi_2016
ltd6=list(d2016.groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:5].index)     
data=[go.Bar(x=list(d2016[d2016["medicine"]==col].groupby(["month"])["sales"].sum().index),y=list(d2016[d2016["medicine"]==col].groupby(["month"])["sales"].sum().values),name="medicine_"+str(col)) for col  in ltd6]
layout=go.Layout(title="Montly sales of first 5 medicines in Delhi in the year 2016",xaxis={"title":"months"},yaxis=dict(title="sales"))
fig=go.Figure(data=data,layout=layout)
fig.show()
  • we could clearly see the same pattern of sales in Delhi as that of in Mumbai
  • I mean more sales for the months(1,3,7,9,10,11,12)

Kolkata 2016

In [38]:
k2016=Kolkata[Kolkata["year"]==2016]#d2016=Kolkata 2016,ltd6=list_Kolkata_2016
ltk6=list(k2016.groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:5].index)     
data=[go.Bar(x=list(k2016[k2016["medicine"]==col].groupby(["month"])["sales"].sum().index),y=list(k2016[k2016["medicine"]==col].groupby(["month"])["sales"].sum().values),name="medicine_"+str(col)) for col  in ltk6]
layout=go.Layout(title="Montly sales of first 5 medicines in Kolkata in the year 2016",xaxis={"title":"months"},yaxis=dict(title="sales"))
fig=go.Figure(data=data,layout=layout)
fig.show()

Sales pattern looks similar at all the cities at a given moment time , i mean change in number of sales at a given time looks same, range can be different

Inference: We can aggregate the sales per day for each city and divide it with values in sales column we can find the sales contribuation of each medicine for that day

Aggregated Yearly sales

In [39]:
year_cumulative = train.groupby(["year"])["sales"].sum()
year_cumulative = year_cumulative.reset_index()
year_cumulative["sales"]=np.round(year_cumulative["sales"])
fig = px.bar(year_cumulative, x="year", y="sales",  barmode='group',labels="sales",text="sales",title="Aggregated Yearly sales")
fig.show()
  • Clearly we can see a trend in the yearly sales so company sales are increasing year by year

Let us see for the year 2015 and try different patterns too

Delhi 2015

In [40]:
d2015=delhi[delhi["year"]==2015]#d2015=delhi 2015,ltd5=list_Delhi_2015
ltd5=list(d2015.groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].index)    
data=[go.Scatter(x=list(d2015[d2015["medicine"]==col].groupby(["month"])["sales"].sum().index),y=list(d2015[d2015["medicine"]==col].groupby(["month"])["sales"].sum().values),mode="lines+markers",name="medicine_"+str(col)) for col  in ltd5]
layout=go.Layout(title="Montly sales of first 10 medicines in Delhi in the year 2015",xaxis={"title":"months"},yaxis=dict(title="sales"))
fig=go.Figure(data=data,layout=layout)
fig.show()
  • We could see sum of the medicines have a similar pattern which may be antibiotics for a particular drug

Kolkata 2015

In [41]:
k2015=Kolkata[Kolkata["year"]==2015]#k2015=Kolkata 2015,ltk5=list_Kolkata_2015
ltk5=list(k2015.groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].index)
data=[go.Scatter(x=list(k2015[k2015["medicine"]==col].groupby(["month"])["sales"].sum().index),y=list(k2015[k2015["medicine"]==col].groupby(["month"])["sales"].sum().values),mode="markers",name="medicine_"+str(col)) for col  in ltk5]
layout=go.Layout(title="Montly sales of first 10 medicines in Kolkata in the year 2015",xaxis={"title":"months"},yaxis=dict(title="sales"))
fig=go.Figure(data=data,layout=layout)
fig.show()
  • We could see some of the medicines are seasonal like red dots (medicine 316)

Bangalore 2015

In [42]:
b2015=Bangalore[Bangalore["year"]==2015]#b2015=Bangalore 2015,ltb5=list_Bangalore_2015
ltb5=list(b2015.groupby(["medicine"])["sales"].sum().sort_values(ascending=False)[:10].index)    
data=[go.Scatter(x=list(b2015[b2015["medicine"]==col].groupby(["month"])["sales"].sum().index),y=list(b2015[b2015["medicine"]==col].groupby(["month"])["sales"].sum().values),mode="lines",name="medicine_"+str(col)) for col  in ltb5]
layout=go.Layout(title="Montly sales of first 10 medicines in Bangalore in the year 2015",xaxis={"title":"months"},yaxis=dict(title="sales"))
fig=go.Figure(data=data,layout=layout)
fig.show()
  • Each city is having some differnt types of medicine which are being sold in that region

Let us observe the sales aggregated at year and month

In [44]:
yr_mnt=train.groupby(["year","month"])["sales"].sum().reset_index()
yr_mnt["year-month"]=yr_mnt.apply(lambda row:str(int(row["year"]))+"-"+str(int(row["month"])),axis=1)
In [45]:
fig5 = px.bar(yr_mnt, x="year-month", y="sales",  barmode='group',title="Aggregated sales for each month")
fig5.show()
  • Overall we could see a trend as it is increasing and seasonality in total sales of medicine while aggregating total sales of all medicines at month level
  • When observed for each medicine at each city level we dont find it in all medicines and some of the medicines are seasonal type
  • December has highest number of sales in the entire year so we can add one catagorical column December(0 or 1)

Let us see for each month which city is having highest sales

Function returns aggregated sales for each month for top 5 cities
In [51]:
def cty_yr_mnt(df,cols):
    df1=pd.DataFrame()
    lt=["year","month"]
    for col in cols:
        df1[col]=df[df["city"]==col].groupby(["year", "month"])["sales"].sum().reset_index()["sales"]
    df1.columns=df1.columns.map(city)
    df1[lt[0]]=df[df["city"]==1].groupby(["year", "month"])["sales"].sum().reset_index()["year"]
    df1[lt[1]]=df[df["city"]==1].groupby(["year", "month"])["sales"].sum().reset_index()["month"]
    df1["year-month"]=df1.apply(lambda row:str(int(row[lt[0]]))+"-"+str(int(row[lt[1]])),axis=1)
    df1=df1.drop(lt,axis=1)
    return df1
In [52]:
lis=[4,5,7,1,6] # list of top 5 cities
cty_yr_mnt=cty_yr_mnt(train,lis)
In [53]:
data = [go.Bar(x= cty_yr_mnt['year-month'],y=cty_yr_mnt[col],name=col) for col in list(cty_yr_mnt.columns[:-1])]
layout=go.Layout(title="aggregated monthly sales from top 5 cities")
fig=go.Figure(data,layout)
fig.show()
  • We could clearly see more randomness in the data when aggregated for each city at month level

Let us see the distribution of sales and footfall data aggregated at day level

In [54]:
dfs=pd.DataFrame(np.ceil(train.groupby(["date"])["footfall"].mean()/250))# dfs=date_footfall_sale_dataframe
dfs["sales"]=np.ceil(train.groupby(["date"])["sales"].mean())
dfs.reset_index()
fig = px.histogram(dfs, x=dfs["sales"],title="Distrbution of sales")
fig.update_layout(xaxis_title="Sales data",yaxis_title="Count of sales")
fig.show()
  • Its a right skewed distribution
In [55]:
 #dfs=date_footfall_sale_dataframe
fig = px.histogram(dfs, x=dfs["footfall"],title="Distrbution of footfall")
fig.update_layout(xaxis_title="Footfall",yaxis_title="count of footfall")
fig.show()
  • It looks like a right skewed dirtibution, let us apply transformations-

After Transformations

In [56]:
#dfs=date_footfall_sale_dataframe
fig = px.histogram(dfs, x=np.log1p(dfs["sales"]),title="Distrbution of log transformed sales data")
fig.update_layout(xaxis_title="log transformed sales",yaxis_title="count of log transformed sales")
fig.show()
  • It looks like normal distribution
In [57]:
#dfs=date_footfall_sale_dataframe
fig = px.histogram(dfs, x=np.log1p(dfs["footfall"]),title="Distrbution of log transformed footfall data")
fig.update_layout(xaxis_title="log transformed  Footfall",yaxis_title="count of log transformed footfall")
fig.show()
  • Looks like near to normal distribution
  • From the above distributions we could see a similar pattern between footfall and sales and we could compare it more clearly in the below graph
In [58]:
#dfs=date_footfall_sale_dataframe
data = [go.Box(y=np.log1p(dfs["footfall"]),name='Footfall'),
        go.Box(y=np.log1p(dfs["sales"]),name='Sales')]
fig=go.Figure(data)
fig.show()
  • There are few data points seems to be influential data points for footfall and sales

Let us see the sales in a different way to check whether its following normal distrubtion or not

Q-Q plots before and after Transformations

In [59]:
import scipy.stats
import pylab
scipy.stats.probplot(train.groupby(["day","month","year"])["sales"].mean(), plot = pylab)
plt.title("QQ Plot", size = 24)
pylab.show()
  • Not following normality curve at the ends and for points in middle

After applying transformations

In [60]:
import scipy.stats
import pylab
scipy.stats.probplot(np.log1p(train.groupby(["day","month","year"])["sales"].mean()), plot = pylab)
plt.title("QQ Plot", size = 24)
pylab.show()
  • After applying transformations we could see its following normal distribution better than before

Let us see how the daily sales looks like

In [61]:
tr1=train.set_index("date") 
tr1['sales'].plot(figsize= (20,8), color = "blue") 
plt.show()
  • More like randomness

Let us see the aggregated daily sales i.e sales grouped by date with rolling mean (as there is more randomness in daily data) and standard deviation

In [62]:
daily_aggregated_sales_log=np.log(dfs.sales)# dfs=date_footfall_sale_dataframe
moving_avg=daily_aggregated_sales_log.rolling(window=30).mean()
moving_std=daily_aggregated_sales_log.rolling(window=30).std()

daily_aggregated_sales_log.plot(label='Log Scale')
moving_avg.plot(label='moving_avg')
moving_std.plot(label='moving_std')
plt.legend(loc='best')
plt.show()
  • We could observe variance is almost constant and mean is also constant with slight variations

Decomposition

Let us see descompose the sales data to see seasonality trend

In [63]:
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(dfs['sales'].rolling(30, center=True).mean().dropna(), period=30)
plt.rcParams["figure.figsize"] = (20,20)
result.plot()
plt.show()
  • we dont see residuals forming any pattern but we could see seasonality and trend at some locations from the graphs but let us double check with ACF and PACF plots
In [64]:
import statsmodels.api as sm
import statsmodels.tsa.api as smt
import statsmodels.formula.api as smf

fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(dfs.sales, lags=40, ax=ax1) #
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(dfs.sales, lags=40, ax=ax2)#
  • From the above graphs we could clearly see that we trend and seasonality in the aggregated sales at day level

Let us double check with Dickey-Fuller Test:

In [65]:
from statsmodels.tsa.stattools import adfuller
def adf_test(timeseries):
    #Perform Dickey-Fuller test:
    print ('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)

#apply adf test on the series
adf_test(dfs['sales'])
Results of Dickey-Fuller Test:
Test Statistic                   -5.578271
p-value                           0.000001
#Lags Used                       22.000000
Number of Observations Used    1246.000000
Critical Value (1%)              -3.435609
Critical Value (5%)              -2.863862
Critical Value (10%)             -2.568006
dtype: float64
we could see data becomes stationary as pvalue<0.05 considering 21 lags , so let us double check to confirm with another test
Null hypothesis in Dickey-Fuller Test: Data is non-stationary
Alternate hypothesis : Data is Stationary
since p value of 0.000247 is less than 0.01 considering 99% confidence too we reject null hypothesis so data is stationary

KPSS Test

In [66]:
from statsmodels.tsa.stattools import kpss

def kpss_test(timeseries):
    #Perform Kwiatkowski-Phillips-Schmidt-Shin test(for trend stationarity):
    print ('Results of KPSS Test:')
    dftest = kpss(dfs["sales"], "ct")
    dfoutput = pd.Series(dftest[0:3], index=['Test Statistic','p-value','#Lags Used'])
    for key,value in dftest[3].items():
        dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)

#apply kpss test on the series
kpss_test(dfs['sales'])
Results of KPSS Test:
Test Statistic            0.234536
p-value                   0.010000
#Lags Used               23.000000
Critical Value (10%)      0.119000
Critical Value (5%)       0.146000
Critical Value (2.5%)     0.176000
Critical Value (1%)       0.216000
dtype: float64
C:\Users\jayam\Anaconda3\lib\site-packages\statsmodels\tsa\stattools.py:1685: InterpolationWarning:

p-value is smaller than the indicated p-value

Null hypothesis in KPSS Test: Data is trend-stationary
Alternate hypothesis : Data is not Stationary
since p value of 0.1 is greater than 0.01 considering 90% confidence too we fail to reject null hypothesis so data is stationary

Let us do the Dickey-Fuller test for medicines aggregated at day level

Build a function to return p values and take input as each medicine data aggregated at day level
In [67]:
from statsmodels.tsa.stattools import adfuller
def adf1_test(timeseries):
    #Perform Dickey-Fuller test:
    dftest1 = adfuller(timeseries, autolag='AIC')
    dfoutput1 = pd.Series(dftest1[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest1[4].items():
        dfoutput1['Critical Value (%s)'%key] = value
    return dfoutput1[1]

Build a function to return p-values taking input as our final data frame

In [68]:
def med_agg(df,meds):
    lis=[]
    for c in meds:
        #print(c)
        t=adf1_test(df[df["medicine"]==c].groupby(["date"])["sales"].mean().reset_index()["sales"])
        lis.append(t)
    df1=pd.DataFrame(lis,meds,columns=["p-values"])
    df1.index.name="medicine"
    return df1

constraints in Dickey-Fuller test, we cannot run the model for below 12 rows of data, so let us check the value_counts for medicines

In [69]:
med_valcnt=pd.DataFrame(train["medicine"].value_counts())
med_valcnt.index.name="medicine"
med_valcnt.columns=["values"]
med_nopval=list(med_valcnt[med_valcnt["values"]<12].index)
len(med_nopval)
Out[69]:
82
  • We have 82 medicines on which we cannot run Dicky Fuller Test due to lack of data

Let us run the test on the remaining medicines

In [70]:
med_p=med_agg(train,list(train["medicine"].value_counts().index)[:-82])
In [71]:
med_p[med_p["p-values"]<0.05].shape[0],med_p[med_p["p-values"]>0.05].shape[0]
Out[71]:
(2806, 501)
  • So data is stationary as per Dickey-Fuller test since p-value is less than 0.05 for 2806 medicine types & there are 501 medicines for which data is not stationary

Let us check for city_medicines types for Dickey-Fuller test

Build a function to return p values and take input as each city_medicine data at day level
In [72]:
def med_city(df,cols):
    lis=[]
    for col in cols:
        t=adf1_test(df[df["city_medicine"]==col]["sales"])
        lis.append(t)
    df1=pd.DataFrame(lis,cols,columns=["p-values"])
    df1.index.name="medicine"
    return df1
In [73]:
city_med_valcnt=pd.DataFrame(train["city_medicine"].value_counts())
city_med_valcnt.index.name="city_medicine"
city_med_valcnt.columns=["values"]
city_med_nopval=list(city_med_valcnt[city_med_valcnt["values"]<12].index)
len(city_med_nopval)
Out[73]:
590
In [74]:
len(city_med_valcnt.index[:-590]),len(city_med_valcnt.index)
Out[74]:
(31957, 32547)
  • We should run on 31957 types of city_medicines, very big number we will run this on another file , we have city medicine combinations of 32547

city_med_p=med_city(train,list(city_med_valcnt.index[:-590]))

city_med_p[city_med_p["p-values"]<0.05].shape[0],city_med_p[city_med_p["p-values"]>0.05].shape[0]

Let us explore the test data set

Read the combined merged test dataset
In [75]:
test=pd.read_csv(path +"\\test_finaldata.csv", sep=",", na_values=["?",",","#","NaN","unknown",""])
Checking dtypes
In [76]:
test.dtypes 
Out[76]:
id                 int64
year               int64
month              int64
day                int64
city               int64
medicine           int64
date              object
discounted         int64
footfall         float64
year_month        object
city_medicine     object
week               int64
dtype: object

Let us check whether all combinations of city_medicine is there in train dataset or not

In [77]:
test_bool=pd.DataFrame(test["city_medicine"].value_counts())

.isin command checks each value in dataframe is in list or not

In [78]:
test_bool["bool"]=test["city_medicine"].value_counts().index.isin(list(city_med_valcnt.index))
In [79]:
test_bool[test_bool["bool"]==False].shape[0]
Out[79]:
313
  • There are 313 combinations of city_medicines which are available in test data but not in train dataset

Let us check whether all kinds of medicine types is there in train dataset or not

In [80]:
tmed=pd.DataFrame(test["medicine"].value_counts())
In [81]:
tmed["fbool"]=test["medicine"].value_counts().index.isin(list(train["medicine"].value_counts().index))
In [82]:
tmed[tmed["fbool"]==False].shape[0]
Out[82]:
34
  • There are 34 different types of medicines in test data which are not in train dataset

Let us build one simple Arima Model on daily level aggregated sales

In [83]:
from statsmodels.tsa.arima_model import ARIMA
arima_mod = sm.tsa.ARIMA(dfs.sales, (1,1,1)).fit(disp=0)
print(arima_mod.summary())
C:\Users\jayam\Anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:218: ValueWarning:

A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.

C:\Users\jayam\Anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:218: ValueWarning:

A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.

                             ARIMA Model Results                              
==============================================================================
Dep. Variable:                D.sales   No. Observations:                 1268
Model:                 ARIMA(1, 1, 1)   Log Likelihood               -4948.885
Method:                       css-mle   S.D. of innovations             11.976
Date:                Sat, 19 Sep 2020   AIC                           9905.771
Time:                        12:11:16   BIC                           9926.352
Sample:                             1   HQIC                          9913.502
                                                                              
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -0.0011      0.010     -0.114      0.910      -0.020       0.018
ar.L1.D.sales     0.4333      0.027     15.894      0.000       0.380       0.487
ma.L1.D.sales    -0.9849      0.007   -132.139      0.000      -0.999      -0.970
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            2.3078           +0.0000j            2.3078            0.0000
MA.1            1.0153           +0.0000j            1.0153            0.0000
-----------------------------------------------------------------------------
  • we could see both ar and ma coefficients are significant in this Arima model(1,1,1)